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Abstract 

This paper presents a new privacy-preserving smart metering system. Our scheme is 
private under the differential privacy model and therefore provides strong and provable 
guarantees. With our scheme, an (electricity) supplier can periodically collect data from 
smart meters and derive aggregated statistics while learning only limited information 
about the activities of individual households. For example, a supplier cannot tell from a 
user's trace when he watched TV or turned on heating. Our scheme is simple, efficient 
and practical. Processing cost is very limited: smart meters only have to add noise to 
their data and encrypt the results with an efficient stream cipher. 
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1 Introduction 



Several countries throughout the world are planning to deploy smart meters in households 
in the very near future. The main motivation, for governments and electricity suppliers, is 
to be able to match consumption with generation. Traditional electrical meters only measure 
total consumption on a given period of time (i.e., one month or one year). As such, they do 
not provide accurate information of when the energy was consumed. Smart meters, instead, 
monitor and report consumption in intervals of few minutes. They allow the utility provider to 
monitor, almost in real-time, consumption and possibly adjust generation and prices according 
to the demand. Billing customers by how much is consumed and at what time of day will 
probably change consumption habits to help matching consumption with generation. In the 
longer term, with the advent of smart appliances, it is expected that the smart grid will 
remotely control selected appliances to reduce demand. 

Problem Statement: Although smart metering might help improving energy management, 
it creates many new privacy problems [I]. Smart meters provide very accurate consumption 
data to electricity providers. As the interval of data collected by smart meters decreases, 
the ability to disaggregate low-resolution data increases. Analyzing high-resolution consump- 
tion data, Nonintrusive Appliance Load Monitoring (NALM) [12] can be used to identify a 
remarkable number of electric appliances (e.g., water heaters, well pumps, furnace blowers, 
refrigerators, and air conditioners) employing exhaustive appliance signature libraries. Re- 
searchers are now focusing on the myriad of small electric devices around the home such as 
personal computers, laser printers, and light bulbs [16]. Moreover, it has also been shown 
that even simple off-the-shelf statistical tools can be used to extract complex usage patterns 
from high-resolution consumption data [17]. This extracted information can be used to pro- 
file and monitor users for various purposes, creating serious privacy risks and concerns. As 
data recorded by smart meters is lowering in resolution, and inductive algorithms are quickly 
improving, it is urgent to develop privacy-preserving smart metering systems that provide 
strong and provable guarantees. 

Contributions: We propose a privacy-preserving smart metering scheme that guarantees 
users' privacy while still preserving the benefits and promises of smart metering. Our contri- 
butions are many- fold and summarized as follows: 

• We provide the first provably private and distributed solution for smart metering that 
optimizes utility without relying on a third trusted party (i.e., an aggregator). We were 
able to avoid the use of a third trusted party by proposing a new distributed Laplacian 
Perturbation Algorithm (DLPA). 

In our scheme, smart meters are grouped into clusters, where a cluster is a group of 
hundreds or thousands of smart meters corresponding, for example, to a quarter of a 
city. Each smart meter sends, at each sampling period, their measures to the supplier. 
These measures are noised and encrypted such that the supplier can compute the noised 
aggregated electricity consumption of the cluster, at each sampling period, without 
getting access to individual values. The aggregate is noised just enough to provide 
differential privacy to each participating user, while still providing high utility (i.e., low 
error) . Our scheme is secure under the differential privacy model and therefore provides 
strong and provable privacy guarantees. In particular, we guarantee that the supplier 
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can retrieve information about any user consumption only up to a predefined threshold. 
Our scheme is simple, efficient and practical. It requires either one or two rounds of 
message exchanges between a meter and the supplier. Furthermore, processing cost 
is very limited: smart meters only have to add noise to their data and encrypt the 
results with an efficient stream cipher. Finally, our scheme is robust against smart 
meter failures and malicious nodes. More specifically, it is secure even if an a fraction 
of all nodes of a cluster collude with the supplier, where a is a security parameter. 

• We provide a detailed analysis of the security and performance of our proposal. The 
security analysis is performed analytically. The performance, which is evaluated using 
the utility metric, is performed using simulation. We implemented a new electricity 
trace generation tool based on [21] which generates one- minute resolution synthetic 
consumption data of different households. 

2 Related Work 

Several papers addressed the privacy problems of smart metering in the recent past [8, 
17, 1, 18, 2, 3, 20, 10]. However, only a few of them have proposed technical solutions to 
protect users' privacy. In [1, 2], the authors discuss the different security aspects of smart 
metering and the conflicting interests among stakeholders. The privacy of billing is considered 
in [20, 17]. These techniques uses zero- knowledge proofs to ensure that the fee calculated by 
the user is correct without disclosing any consumption data. 

Seemingly, the privacy of monitoring the sum consumption of multiple users may be 
solved by simply anonymizing individual measurements like in [8] or using some mixnet. 
However, these "ad-hoc" techniques are dangerous and do not provide any real assurances of 
privacy. Several prominent examples in the history have shown that ad-hoc methods do not 
work [14]. Moreover, these techniques require an existing trusted third party who performs 
anonymization. The authors in [3] perturb the released aggregate with random noise and use 
a different model from ours to analyze the privacy of their scheme. However, they do not 
encrypt individual measurements which means that the added noise must be large enough 
to guarantee reasonable privacy. As individual noise shares sum up at the aggregation, the 
final noise makes the aggregate useless. In contrast to this, [10] uses homomorphic encryption 
to guarantee privacy for individual measurements. However, the aggregate is not perturbed 
which means that it is not differential private. 

The notion of differential privacy was first proposed in [7]. The main advantage of differ- 
ential privacy over other privacy models is that it does not specify the prior knowledge of the 
adversary and provides rigorous privacy guarantee if each users' data is statistically indepen- 
dent [13] . Initial works on differential privacy focused on the problem how a trusted curator 
(aggregator), who collects all data from users, can differential privately release statistics. By 
contrast, our scheme ensures differential privacy even if the curator is untrusted. Although 
[6] describes protocols for generating shares of random noise which is secure against malicious 
participants, it requires communication between users and it uses expensive secret sharing 
techniques resulting in high overhead in case of large number of users. Similarly, traditional 
Secure Multiparty Computation (SMC) techniques [11] [5] also require interactions between 
users. All these solutions are impractical for resource constrained smart meters where all the 
computation is done by the aggregator and users are not supposed to communicate with each 
other. 
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Two closely related works to ours are [19] and [22]. In [19], the authors propose a scheme 
to differential privately aggregate sums over multiple slots when the aggregator is untrusted. 
However, they use the threshold Paillier cryptosystem [9] for homomorphic encryption which is 
much more expensive compared to [4] that we use. They also use different noise distribution 
technique which requires several rounds of message exchanges between the users and the 
aggregator. By contrast, our solution is much more efficient and simple: it requires only 
a single message exchange if there are no node failures, otherwise, we only need one extra 
round. In addition, our solution does not rely on expensive public key cryptography during 
aggregation. 

A recent paper [: ] proposes another technique to privately aggregate time series data. 
This work differs from ours as follows: (1) they use a Dime-Hellman-based encryption scheme, 
whereas our construction is based on a more efficient construction that only use modular 
additions. This approach is better adapted to resource constrained devices like smart meters. 
(2) Although [ I] does not require the establishment (and storage) of pairwise keys between 
nodes as opposed to our approach, it is unclear how ['. !] can be extended to tolerate node 
and communication failures. By contrast, our scheme is more robust, as the encryption key 
of non-responding nodes is known to other nodes in the network that can help to recover 
the aggregate. (3) Finally, [ ] uses a different noise generation method from ours, but 
this technique only satisfies the relaxed (e, ^-differential privacy definition. Indeed, in their 
scheme, each node adds noise probabilistically which means that none of the nodes add noise 
with some positive probability S. Although 5 can be arbitrarily small, this also decreases the 
utility. By contrast, in our scheme, 5 = while ensuring nearly optimal utility. 

3 The model 

3.1 Network model 

The network is composed of four major parts: the supplier/ aggregator, the electricty 
distribution network, the communication network, and the users (customers). Every user is 
equipped with an electricity smart meter, which measures the electricity consumption of the 
user in every T p long period, and, using the communication network, sends the measurement 
to the aggregator at the end of every slot (in practice, T p is around 1-30 minutes). Note that 
the communication and distribution network can be the same (e.g., when PLC technology is 
used to transfer data) . The measurement of user i in slot t is denoted by X\ . The consumption 
profile of user i is described by the vector {X\, X\, . . .), where the measurements of different 
users are statistically independent. Privacy directly correlates with T p ; finer-grained samples 
means more accurate profile, but also entails weaker privacy. The supplier is interested in the 

sum of all measurements in every slot (i.e., YliLi ^-t X t ). 

As in [3], we also assume that smart meters are trusted devices (i.e., tamper-resistant) 
which can store key materials and perform crypto computations. This realistic assumption 
has also been confirmed in [2]. We assume that each node is configured with a private key and 
gets the corresponding certificate from a trusted third party. For example, each country might 
have a third party that generates these certificate and can additionally generate the "supplier" 
certificates to supplier companies [2]. As in [2], we also assume that public key operations 
are employed only for initial key establishment, probably when a meter is taken over by a 
new supplier. Messages exchanged between the supplier and the meters are authenticated 
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using pairwise MACs 1 . Smart meters are assumed to have bidirectional communication 
channel (using some wireless or PLC technology) with the aggregator, but the meters cannot 
communicate with each other. We suppose that nodes may (randomly) fail, and in these 
cases, cannot send their measurements to the aggregator. However, nodes are supposed to 
use some reliable transport protocol to overcome the transient communication failures of the 
channel. Finally, we note that smart meters also allow the supplier to perform fine-grained 
billing based on time-dependant variable tariffs. Here, we are not concerned with the privacy 
and security problems of this service. Interested readers are referred to [20, 17]. 

3.2 Adversary model 

In general, the objective of the adversary is to infer detailed information about household 
activity (e.g, how many people are in home and what they are doing at a given time). In order 
to do that, it needs to extract complex usage patterns of appliances which include the level 
of power consumption, periodicity, and duration. It has been shown in [17] that different 
data mining techniques can be easily applied to a raw consumption profile to obtain this 
information. 

In terms of its capability, we distinguish three types of adversary. The first is the a honest- 
but-curious (HC) adversary, who attempts to obtain private information about a user, but 
it follows the protocol faithfully and do not provide false information [ ']. It only uses the 
(non-manipulated) collected data. 

The dishonest-but-non-intrusive (DN) adversary may not follow the protocol correctly 
and is allowed to provide false information to manipulate the collected data. Some users can 
also be malicious and collude even with the supplier to collect information about honest users. 
However, the DN adversary is not allowed to access and modify the distribution network to 
mount attacks. In particular, he is not allowed to install wiretapping devices to eavesdrop on 
the victim's consumption. 

Likewise the DN adversary, the strongest dishonest- and-intrusive (DI) adversary may not 
follow all protocols either, but that can, in addition, invade the distribution network to gather 
more information about clients. In other words, the DI adversary can monitor the electricity 
consumption of the clients by installing meters on the power line that is outside of the client's 
control (like outside from his household). 

We suppose that all types of adversary can have any kind of extra knowledge about honest 
users, beyond the collected measurements, which might help to infer private information about 
them. For instance, it can observe their daily activities 2 , or obtain extra information by doing 
personal interviews, surveys, etc. 

3.3 Privacy model 

We use differential privacy [7] that models the adversary described above. In particular, 
differential privacy guarantees that a user's privacy should not be threatened substantially 
more if he provides his measurement to the supplier. 

Definition 1 (e-differential privacy) An algorithm A is e- differential private, if for all 
data sets D\ and D2, where D\ and D2 differ in at most a single user, and for all subsets of 

1 Please refer to [18] for a more detailed discussion about key management issues in smart metering systems. 
2 Similarly to monitoring neighbors. Indeed, neighbors can also be malicious users, which is included in our 
model. 
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possible answers S C Range(A), 

P(A(Dt) eS)<e £ ■ P{A{D 2 ) e S) 

Differential private algorithms produce indistinguishable outputs for similar inputs (more 
precisely, differing by a single entry), and thus, the modification of any single user's data 
in the dataset changes the probability of any output only up to a multiplicative factor e £ . 
The parameter e allows us to control the level of privacy. Lower values of e implies stronger 
privacy, as they restrict further the influence of a user's data on the output. Note that this 
model, if users' data are independent, guarantees privacy for a user even if all other users' 
data is known to the adversary (e.g., it knows all measurements comprising the aggregate 
except the target user's), like when N — 1 out of N users are malicious and cooperate with 
the supplier. 

Example 1 (Illustration of e-differential privacy) There is a dataset D containing a 
list of patients' entries. Each entry has an attribute that indicates whether the corresponding 
patient has cancer or not. Suppose an e- differential private query A that returns the sanitized 
number of patients in D that have cancer. We assume that the adversary knows the exact 
number of cancer patients, x, before adding Alice to D, and wants to learn from the random 
output O of A{D U {Alice]) whether Alice has cancer or not. The adversary has no prior 
knowledge about Alice (i.e., the probability that Alice has cancer is 0.5 before accessing O). 
The adversary either infers Alice as a cancer or a non-cancer patient. The success probability 
of this inference has a maximum of 1+ ^_ e (and > 0.5 ^ 3 . For example, the values 2, 1, 0.5, 
0.1 of e yield correct inferences with a maximum probability of 0.88, 0.73, 0.62, 0.52, resp. 

The definition of differential privacy also maintains a composability property: the compo- 
sition of differential private algorithms remains differential private and their e parameters are 
accumulated. In particular, a protocol having t rounds, where each round is individually e 
differential private, is itself t ■ e differential private. 



3.4 Output perturbation: achieving differential privacy 

Let's say that we want to publish in a differentially private way the output of a function 
/. The following theorem says that this goal can be achieved by perturbing the output of 
/; simply adding a random noise to the value of /, where the noise distribution is carefully 
calibrated to the global sensitivity of /, results in e-differential privacy. The global sensitivity 
of a function is the maximum "change" in the value of the function when its input differs in 
a single entry. For instance, if / is the sum of all its inputs, the sensitivity is the maximum 
value that an input can take. 

Theorem 1 (Laplacian Perturbation Algorithm (LPA) [7]) For all f : D — )• W , the 

following mechanism A is e -differential private: A(D) = f (D) + C(S '(/) / e) , where £(S(f)/e) 



3 Let A denote the event that Alice has cancer. Using a bayesian reasoning, P(A\0) = P (q^2)+pIo\a) ~ 

P{ A(x+i^oV+P(A(*)=0) < I+^> where we used that P (^) = P @) and e ~ £ ^ pX+iT°0) < e£ - Moreover, 
the optimal inference strategy is the maximum likelihood decision: the adversary infers Alice as a cancer 

patient if P(A(x + 1) = O) > P{A{x) = O) or with probability 0.5 if P{A(x + 1) = O) = P(A{x) = O), 

otherwise as a non-cancer patient. 
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is an independently generated random variable following the Laplace distribution and S(f) 
denotes the global sensitivity of / 4 . 

Example 2 To illustrate these definitions, consider a mini smart metering application, where 
users U\, U2, and U3 need to send the sum of their measurements in two consecutive slots. 
The measurements ofU 1; U 2 and U 3 are (X\ = 300, A" 1 = 300), (Xf = 100, X\ = 400), and 
(Xf = 50, X\ = 150), resp. The nodes want differential privacy for the released sums with 
at least a e = 0.5. Based on Theorem 1, they need to add C(X = maxj Ylt X\/Q.h = 1200) 
noise to the released sum in each slot. This noise ensures e = Ylt-^-t/^ = 0-5 individual 
indistinguishability for XJ\, e = 0.42 for U2, and e = 0.17 for U3. Hence, the global e = 0.5 
bound is guaranteed to all. Another interpretation is that U\ has E\ = X\j\ = 0.25, £2 = 
X\j\ = 0.25 privacy in each individual slot, and e = e% + e 2 = 0.5 considering all two slots 
following from the composition property of differential privacy. 

3.5 Utility definition 

Let / : D — > R. In order to measure the utility, we quantify the difference between 
f(D) and its perturbed value (i.e., f(D) = f(D) + £(A)) which is the error introduced by 
LPA. A common scale-dependant error measure is the Mean Absolute Error (MAE), which 
is E|/(£>) — f(D)\ in our case. However, the error should be dependent on the non-perturbed 
value of f(D); if f(D) is greater, the added noise becomes small compared to f{D) which 
intuitively results in better utility. Hence, we rather use a slightly modified version of a 
scale- independent metric called Mean Absolute Percentage Error (MAPE), which shows the 
proportion of the error to the data, as follows. 

Definition 2 (Error function) Let Dt 6 B denote a dataset in time-slot t. Furthermore, 
let St = ^^(D^+i^ (z- e -j ^ e value of the error in slot t). The error function is defined as 
/j,(t) = E(^). The expectation is taken on the randomness of f(D t ). The standard deviation 
of the error is a(t) = W Var{8t) * n time t. 

In the rest of this paper, the terms "utility" and "error" are used interchangeably. 

4 Objectives 

Our goal is to develop a practical scheme that should not introduce more privacy risks for 
users than traditional metering systems while retaining the benefits of smart meters. More 
specifically, the scheme should be 

• differentially private: Considering DN adversary, the scheme differential privately 
releases sanitized aggregates Xj where the leaked information about users is measured 
by e. 

• robust and easily configurable: It tolerates (random) node failures. 

4 Formally, let / : B — s- R r , then the global sensitivity of / is S(f) = max ||/(Di) - /(Da)||i, where Dj and 
D2 differ in a single entry and || • ||i denotes the L\ distance. 
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• efficient: It has low overhead which includes low computation load on smart meters, and 
low communication overhead between the supplier and individual meters. It should use 
pubic key operations only for initial key establishment. Afterwards, all communication 
is protected using more efficient symmetric crypto-based techniques. 

• distributed: Besides a certificate authority, the protocol does not require any trusted 
third party such as a trusted aggregator as in [3]. The smart meters communicate 
directly with the supplier. 

• useful for the supplier: The sanitized and the original (non-sanitized) aggregate should 
be "similar" (i.e., the error should be as small as possible). For instance, the supplier 
should be able to perform efficient management of the resource using the sanitized data: 
to monitor the consumption at the granularity of a maximum few hundred households, 
and to detect consumption peaks or abnormal consumption. 



5 Overview of approaches 



Our task is to enable the supplier to calculate the sum of maximum N measurements (i.e., 
Y2i=i XI = Xi in all t) coming from N different users while ensuring e-differential privacy for 
each user. This is guaranteed if the supplier can only access Xj + £(A(i)), where C(X(t)) 5 is 
the Laplace noise calibrated to e as it has been described in Section 3.4. There are (at least) 
3 possible approaches to do this which are detailed as follows. 



Enc(Xl) 



Enc(X?) 



Enc(X?) 



Aggregator 



Y,t Enc(X}) + Enc(C(X)) 



Supplier 



(Node 2) • ■ ■ fjode Nj 





Enc{Xl + eri) 


Enc{X? + a 2 ) 


Enc(X; w + o N ) 


Supplier/ Aggregator 





Dec(Y.i Enc{X' t + £(A))) = X, + £(A) 



Dec(Y,i Enc(X\ + m)) = X t + C(\) 



(a) Centralized approach: aggregation with (b) Our approach: aggregation without trusted 
trusted aggregator. entity. If a t — Gi{N,\) + G2{N,\), where Qi, 

G2 are i.i.d gamma noise, then J^iLi = 



Figure 1: Aggregating measurements while guaranteeing differential privacy. 



5.1 Fully decentralized approach (without aggregator) 

Our first attempt is that each user adds some noise to its own measurement, where the 
noise is drawn from a Laplace distribution. In particular, each node i sends the value of 
X\ + C(X) directly to the supplier in time t. It is easy to see that e is guaranteed to all users, 
but in fact the final noise added to the aggregate (i.e., YliLi^W) * s N times larger than 
£(A), and hence, the error is fj,(t) = x t +i ^l Sj=i^(^)l = xl+i • 

5 We will use the notation A instead of \(t) if the dependency on time is obvious in the context. 
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5.2 Aggregation with a trusted aggregator 

Our second attempt can be to aggregate the measurements of some users, and send the 
perturbed aggregate to the supplier. In particular, nodes are grouped into N sized clusters 
and each node of a cluster sends its measurement X\ to the (trusted) cluster aggregator, that 
is a trusted entity different from the supplier. The aggregator computes Xt = Yli=i an d 
obtains X t = ~K t + £(A) by adding noise to the aggregate. This perturbed aggregate is then 
sent to the supplier as it is illustrated in Figure 1(a). 

The utility of this approach is better than in the previous case, as the noise is only added to 
the sum and not to each measurement Xf. Formally, fi(t) = X 1 +1 E|£(A)| = X A +1 . Similary, 

= Itt+i ■ VE|£(A)| 2 - (E\£(XW = xFFT- 

However, the main drawback of this approach is that the aggregator must be fully trusted 
since it receives each individual measurement from the users. This can make this scheme 
impractical if there is no such trusted entity. 

5.3 Our approach: aggregation without trusted entity 

Although the previous scheme is differential private, it works only if the aggregator is 
trustworthy and faithfully adds the noise to the measurement. In particular, the scheme will 
not be secure if the aggregator omits to add the noise. 

Our scheme, instead, does not rely on any centralized aggregator. The noise is added by 
each smart meter on their individual data and encrypted in such a way that the aggregator 
can only compute the (noisy) aggregate. Note that with our approach the aggregator and 
the supplier do need to be separate entities. The supplier can even play the role of the aggre- 
gator, as the encryption prevents it to access individual measurements, and the distributed 
generation of the noise ensures that it cannot manipulate the noise. 

Our proposal is composed of 2 main steps: distributed generation of the Laplacian noise 
and encryption of individual measurements. These 2 steps are described in the remainder of 
this section. 

5.3.1 Distributed noise generation: a new approach 

In our proposal, the Laplacian noise is generated in a fully distributed way as is illustrated 
in Figure 1(b). We use the following lemma that states that the Laplace distribution is 
divisible and be constructed as the sum of i.i.d. gamma distributions. As this divisibility is 
infinite, it works for arbitrary number of users. 

Lemma 1 (Divisibility of Laplace distribution [ ]) Let C(X) denote a random vari- 

1 I x I 

able which has a Laplace distribution with PDF f[x, A) = 2\ e ~ ■ Then the distribution of C(\) 
is infinitely divisible. Furthermore, for every integer n > 1, C(X) = £IL-j£/i(w, A) ~G2(n, A)] ; 
where Qi(n, A) and Qiin^ A) are i.i.d. random variables having gamma distribution with PDF 

g(x,n,X) = ^r{i/n) X ~~ 1 ^ X ^ X where x > 0. 

The lemma comes from the fact that £{X) can be represented as the difference of two 
i.i.d exponential random variables with rate parameter 1/A. Moreover, £™ =1 <5i(^, A) ~~ 
E?=i&(n,A) = &(1/£? =1 £,A) -&(l/£?=i = Si(l,A)-g 2 (l,A) due to the sum- 
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mation property of the gamma distribution 6 . Here, <5i(l, A) and G2O-, A) are i.i.d exponential 
random variable with rate parameter 1/A which completes the argument. 

Our distributed sanitization algorithm is simple; user i calculates value X\ = X\ + 
Oi(N, A) — 02(X, A) in slot t and sends it to the aggregator, where Gi(N, A) and Q2(N,X) 
denote two random values independently drawn from the same gamma distribution. Now, if 
the aggregator sums up all values received from the N users of a cluster, then J2^ =1 Xf = 
E*=i *t + Y*=i[Gi{N, A) - Q 2 (N, A)] = X t + £(A) based on Lemma 1. 

The utility of our distributed scheme is defined as fi(t) = x 1 +1 'K\'X.t— Xf +Ya=i[&i(N, A) — 

Q2(N,X)]\ = ^ = and 5(t) = 



5.3.2 Encryption 

The previous step is not enough to guarantee privacy as only the sum of the measurements 
(i.e., Xf) is differential private but not the individual measurements. In particular, the 
aggregator has access to XI, and even if X\ is noisy, Qi{N, A) — Q2{N, A) is usually insufficient 
to provide reasonable privacy for individual users if N ^> 1. This is illustrated in Figure 2, 
where an individual's noisy and original measurements slightly differ. 




(a) XI (b) XI + g- L (N, A) - g 2 (N, A) 



Figure 2: The original and noisy measurements of user i, where the added noise is Qi(N, A) — 
G 2 (N, A) (N = 100, T p is 10 min). 

To address this problem, each contribution is encrypted using a modulo addition-based 
encryption scheme, inspired by [4], such that the aggregator can only decrypt the sum of 
the individual values, and cannot access any of them. In particular, let ki denote a random 
key generated by user i inside a cluster such that E^ ki = 0, and ki is not known to 
the aggregator. Furthermore, EncQ denotes a probabilistic encryption scheme such that 
Enc(p, k,m) = p + k mod m, where p is the plaintext, k is the encryption key, and m is a 
large integer. The adversary cannot decrypt any Enc(Xl,ki,m), since it does not know ki, 
but it can easily retrieve the noisy sum by adding the encrypted noisy measurements of all 
users; ^ i=1 Enc{X\, ki,m) = ^2 i=1 X\ + ^ i=1 ki = ^i=i m °d m. If z = max i)t (X t J ) then 
m should be selected as m = 2^ log2 ^ z ' N ^ [ :]. The generation of ki is described in Section 6.2. 

6 The sum of i.i.d. gamma random variables follows gamma distribution (i.e., ^2"_-.Q(ki,X) = 
e(l/E7=iir.A)). 
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6 Protocol description 



6.1 System setup 

In our scheme, nodes are grouped into clusters of size N, where N is a parameter. The 
protocol requires the establishment of pairwise keys between each pair of nodes inside a cluster 
that can be done by using traditional Diffie-Hellman key exchange as follows. When a node Vi 
is installed, it provides a self-signed DH component and its certificate to the supplier. Once 
all the nodes of a cluster are installed, or a new node is deployed, the supplier broadcasts the 
certificates and public DH components of all nodes. Finally, each node Vi of the cluster can 
compute a pairwise key Kij shared with any other node Vj in the networks. Note that no 
communication is required between Vi and Vj. 

6.2 Smart meter processing 

Each node Vi sends at time t its periodic measurement, X\, to the supplier as follows: 

Phase 1 (Data sanitization): Node «j calculates value X\ = X\ + Gi(N, A) — Q2{N,\), 
where Q\{N, A) and G2(X, A) denote two random values independently drawn from the 
same gamma distribution and N is the cluster size. 

Phase 2 (Data encryption): Each noisy data X\ is then encrypted into Enc(X\) using 
the modulo addition-based encryption scheme detailed in Section 5.3.2. The following 
extension is then applied to generate the encryption keys: Each node, Vi, selects I other 
nodes randomly, such that if Vi selects Vj, then Vj also selects «j. Afterwards, both nodes 
generate a common dummy key k from their pairwise key K{ Uj adds k to Enc(Xl) 
and Vj adds — k to Enc(X^). As a result, the aggregator cannot decrypt the individual 
ciphertexts (it does not know the dummy key k). However, it adds all the ciphertexts 
of a given cluster, the dummy keys cancel out and it retrieves the encrypted sum of the 
(noisy) contributions. The more formal description is as follows: 

1. node Vi selects some nodes of the cluster randomly (we call them participating 
nodes) using a secure pseudo random function (PRF) such that if V{ selects vj, 
then Vj also selects Uj. In particular, Vi selects Vj if mapping PRF(Ki t j,n) to a 
value between and 1 is less or equal than j^j, where r\ is a public value changing 
in each slot. We denote by £ the number of selected participating nodes, and ind^ [j] 
(for j = 1, ...,£) denotes the index of the £ nodes selected by node V{. Note that, 
for the supplier, the probability that Vi selects Vj is -j^j as it does not know Kij. 
The expected value of £ is w. 

2. Vi computes for each of its I participating nodes a dummy key. A dummy key 
between v% and Vj is defined as dkey,^- = (i — — j\ ■ PRF(Kij,r2), where K^j 
is the key shared by Vi and Vj, and r<i ^ r\ is public value changing in each slot. 
Note that dkey^ ■ = — dkey^ j. 

3. Vi then computes Enc(Xl) = X\ + K[ + Y^j=i dkey ijindi |j-] (mod m), where K[ is 
the keystream shared by Vi and the aggregator which can be established using the 
DH protocol as above, and m is a large integer (see [4]). Note that m must be 
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larger than the sum of all contributions (i.e., final aggregate) plus the Laplacian 
noise. ' 

Note that X\ is encrypted multiple times: it is first encrypted with the keystream 
K's and then with several dummy keys. K[ is needed to ensure confidentiality 
between a user and the aggregator. The dummy keys are needed to prevent the 
aggregator (supplier) from retrieving X\. 

4. Enc(Xl) is sent to the aggregator (supplier). 
6.3 Supplier processing 

Phase 1 (Data aggregation): At each slot, the supplier aggregates the N measurements 
received from the cluster smart meters by summing them, and obtains X*i=i Enc{X\). 
In particular, 

N N I 

Enc{± t ) = Y,{Xi + K) + X>ey, ir{ , (mod m) 

i=l i=l j=l 

where Y^Li Ej=i dke yi,indi[j] = because dkey^ = -dkey^. Hence, 

N N 

Enc(± t ) = J2(Xt +Kl) = J2 Enc{X\) 

i=l i=l 

Phase 2 (Data decryption): The aggregator then decrypts the aggregated value by sub- 
tracting the sum of the node's keystream, and retrieves the sum of the noisy measures: 

N N N 

Enc{Xl) -Y,K[ = Y,Xl (mod m) 

i=l i=l i=l 

where Zli % = E£i + Eli Gi(N, A) - Zli &(^, A) = Zli *\ + C(X) based 
on Lemma 1. 

The main idea of the scheme is that the aggregator is not able to decrypt the individual 
encrypted values because it does not know the dummy keys. However, by adding the different 
encrypted contributions, dummy keys cancel each other and the aggregator can retrieve the 
sum of the plaintext. The resulting plaintext is then the perturbed sums of the measurements, 
where the noise ensures the differential privacy of each user. 

Complexity: Let b denote the size of the pairwise keys (i.e., Kij). Our scheme has 0(N -b) 
storage complexity, as each node needs to store I < N pairwise keys. The computational 
overhead is dominated by the encryption and the key generation complexity. The encryption 
is composed of t < N modular addition of log 2 m bits long integers, while the key generation 
needs the same number of PRF executions. This results in a complexity of 0{N ■ (log 2 m + 
c(6))), where c(b) is the complexity of the applied PRF function. h 

7 Note that the noise is a random value from an infinite domain and this sum might be larger than m. 
However, choosing sufficiently large m, the probability that the sum exceeds m can be made arbitrary small 
due to the exponential tail of the Laplace distribution. 

8 For instance, if log 2 m = 32 bits (which should be sufficient in our application), b = 128, and N = 1000, 
a node needs to store 16 Kb of key data and perform maximum 1000 additions along with 1000 subtractions 
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7 Adding robustness 



We have assumed so far that all the N nodes of a cluster participated in the protocol. 
However, it might happen that, for several different reasons (e.g., node or communication 
failures) some nodes are not able to participate in each epoch. This would have two effects: 
first, security will be reduced since the sum of the noise added by each node will not be 
equivalent to C(X). Hence, differential privacy may not be guaranteed. Second, the aggregator 
will not be able to decrypt the aggregated value since the sum of the dummy keys will not 
cancel out. 

In this section, we extend our scheme to resist node failures. We propose a scheme which 
resists the failure of up to M out of N nodes, where M is a configuration parameter. We will 
study later the impact of the value M on the scheme performance. 

7.1 Sanitization phase extension 

In order to resist the failure of M nodes, each node should add the following noise to their 
individual measurement: Qi(N - M, A) — G 2 {N - M, A). Note that J2iLi M [Gi( N — M,X) — 
Q 2 {N — M, A)] = £(A). Therefore, this sanitization algorithm remains differential private, if 
at least N — M nodes participate in the protocol. Note that in that case each node adds extra 
noise to the aggregate in order to ensure differential privacy even if fewer than M nodes fail 
to send their noise share to the aggregator. 

7.2 Encryption phase extension 

7.2.1 A simple approach 

As described previously, all the dummy keys cancel out at the aggregator. However, this 
is not the case if not all the nodes participate in the protocol. In order to resist the failure of 
nodes, one can extend the encryption scheme with an additional round where the aggregator 
asks the participating nodes of non-responding nodes for the missing dummy keys: 

1. Once the aggregator received all contributions, it broadcasts the ids of the non- 
responding nodes. 

2. Upon the reception of this message, each node Vi verifies whether any of the ids in the 
broadcast message are in its participating node list (i.e., it can be found in indj). For 
each of such id, the node sends the corresponding dummy key to the aggregator. 

3. The aggregator then subtracts all received dummy keys from Enc(X-t) and retrieves 
J2iLi(Xi + K'i) which can be decrypted. 

This approach has a severe problem: if the aggregator is untrusted, it can easily retrieve 
the measurement of a vf broadcasting its id in Step 2, the participating nodes of V{ reply 
with the dummy keys of V{ which can be removed from Enc(Xl). 

(for modular reduction) on 32 bits long integers, and maximum 1000 PRF executions. This overhead should 
be negligible even on constrained embedded devices. 
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7.2.2 Our proposal 

In this approach, each node adds a secret random value to its encrypted value before 
releasing it in the first round. This is needed to prevent the adversary to recover the noisy 
measurement through combining different messages of the nodes. Then, in the second round 
when the aggregator asks for the missing dummy keys, every node reveals its random keys 
along with the missing dummy keys that it knows: 

1. Each node Vi sends Enc{X\) = X\ + K- + Ylj=i dkey^ indi | ;? ] + Cj (mod m) where C, is 
the secret random key of Vi generated randomly in each round. 

2. After receiving all measurements, the aggregator asks all nodes for their random keys 
and the missing dummy keys through broadcasting the id of the non-responding nodes. 

3. Each node Vi verifies whether any ids in this broadcast message are in its participating 
node list, where the set of the corresponding participating nodes is denoted by S. Then, 
Vi replies with ^ jeS dkey i>]nd .^ + d (mod m). 

4. The aggregator subtracts all received values from YliLi ^ nc (^t) which results in 
J2iL\{Xt + as the random keys as well as the dummy keys cancel out. 

Note that as the supplier does not know the random keys, it cannot remove them from 
any messages but only from the final aggregate; adding each node's response to the aggregate 
all the dummy keys and secret random keys cancel out and the supplier obtains X<. Although 
the supplier can still recover X\ if it knows v^s participating nodes (the supplier simply asks 
for all the dummy keys of v; L in Step 2 and subtracts v^s response in Step 4 from Enc(XD), 
we will show later that this probability can be made practically small by adjusting w and N 
correctly. 

Note that the protocol fails if, for some reasons, a node does not send its random key to 
the aggregator (as only the node itself knows its random key, it cannot be reconstructed by 
other parties). However, it is very unlikely that a node between the two rounds fails, and an 
underlying reliable transport protocol helps to overcome communication errors. 

Finally, also note that this random key approach always requires two rounds of commu- 
nication (even if the aggregator receives all encrypted values correctly in the first round), as 
the random keys are needed to be removed from £nc(X t ) in the second round. 



7.3 Utility evaluation 

If all A nodes participate in the protocol, the added noise will be larger than C(X) which 
is needed to ensure differential privacy. In particular, ^2iLi[Gi(N — M, A) — G2 (A — M, A)] = 
£(A) + J2i=i [Gi ( N - M, A) - G 2 (N — M, A)], where the last summand is the extra noise needed 
to tolerate the failure of maximum M nodes. Clearly, this extra noise increases the error if all 
A nodes operate correctly and add their noise shares faithfully. In what follows, we calculate 
the error and its standard deviation if we add this extra noise to the aggregate. 

Theorem 2 (Utility) Let a = M/N and a < 1. Then, 

2 A(t) 



5(1/2,^) X t + 1 
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and 



a{t) < 




2 



a 5(1/2, j^fj 'Xt + 1 



where B(x, y) 



r(«)r(y) 



is i/ie freia function. 



The derivation can be found in Appendix A. Based on Theorem 2, <r(t) = /x(i) • 



than fJ,(t). In particular, if a = (there are no malicious nodes and node failures), then 
a(t) = n(t). If a > then cr(t) < /i(t) but a(t) « /i(t). 

8 Security Analysis 

8.1 Deploying malicious nodes 

In the proposed scheme, each measurement is perturbed and encrypted. Therefore, a 
honest-but-curious attacker cannot gain any information (up to e) about individual measure- 
ments in any slot. This is guaranteed by the encryption scheme and the added noise. 

However, a DN adversary (see Section 3.2), which deploys T malicious nodes, may be able 

to: 

• reduce the noise level by limiting (or omitting) the gamma noise added by malicious 
nodes. As a result, the sum of the noise shares will not equal to the Laplacian noise 
which can decrease the privacy of users. However, recall that, due to the robustness 
property of our scheme detailed in Section 7, we add extra noise to tolerate M node 
failures. Adding extra noise calibrated to M + T is sufficient to tolerate this type of 
attack. 

• decrypt Enc(Xl) of a node V{ and retrieve the perturbed data. As individual data is 
only weakly noised, the attacker might infer some information from them, and therefore, 
compromise privacy. However, the encryption scheme that we used is provably secure [4], 
and nodes are assumed to be tamper-resistant. Thus, the only way to break privacy is 
to retrieve the dummy keys of t> j. Because the participating nodes are selected randomly 
for each message, this can only be achieved if all participating nodes of Vi are malicious 
and the supplier is also malicious (i.e., the adversary knows K[). This happens if Vi 
does not select any honest participating node that has a probability of (1 — t^it) n ~ t ~ 1 . 
For instance, it is easy to check that if N = 100 and 50% of the nodes are malicious 
(which anyway should be a quite strong assumption), then setting w to 30 results in a 
success probability of 1.8 • 10" 8 . This means that if an epoch is 5 min long, then the 
adversary will compromise 1 measurement during 458 years in average. 

Finally, also note that this is the success probability of the adversary in a single slot. 
This means that a supplier that succeeds the previous attack only gets a single (noisy) 
measurement of the customer (corresponding to a single epoch). As a node selects 
different participating nodes in each slot, the probability that the adversary gets k 
different measurements of the node is (1 — j^n) k ^ N ~ T ~ 1 \ which is even smaller. 




. It is easy to check that o~(t) is always less or equal 
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8.2 Lying supplier 

Lying about non-responding nodes 

In addition to deploying malicious (fake) nodes, a malicious supplier can lie about the 
non-responding nodes. In order to recover Xj, the supplier needs Y^j=i ^ & )li,\t\6i\j] + C«- The 
supplier has two options to retrieve this sum. First, it might pretend that a node v% did not 
respond in the first round, and asks for tVs dummy keys to its participating nodes. At the 
same time, the supplier claims to Vi that its participating nodes are responding. Hence, as 
described in Section 7.2.2, the participating nodes of vi will disclose v^s dummy keys and Vi 
will disclose Cj. However, the random keys of Uj's participating nodes prevent the supplier to 
retrieve v^s dummy keys from their messages. 

Second, the supplier can pretend that Vi's participating nodes do not respond in the first 
round, and asks Vi for their dummy keys in the second round. In particular, there are three 
types of dummy keys: the first is shared with a malicious node, and hence, known to the 
supplier. The second is asked to Vi by the supplier in the second round (the supplier pretends 
that these nodes are non-responding), and Vi replies with the sum of Cj and the requested 
keys. Finally, the rest is shared with honest participating nodes and they are not asked to Vi 
in the second round. Apparently, if V{ has at least one dummy key from the last group, its 
measurement cannot be recovered. This is because if Vj is a participating honest node of Vi 
and dkey^injj.rji is not asked to Vi in the second round, it could only be recovered from Vj's 
messages. However, Vj sends Cj + dkey i - ind .yi, where Cj is only known to Vj. 

Nevertheless, it might happen that Vi does not have any third-type dummy key (i.e., the 
supplier asks Vi for all the dummy keys shared with honest nodes). Then, the supplier can 
easily recover v^s measurement, since it knows Ylj=i ^ e Yi ,ind t [j] + C« (they are malicious keys 
or provided by v-i). However, the supplier can only guess Uj's participating nodes and target 
them randomly since Vi also selects them randomly 9 . Assuming that the supplier can ask V{ 
for maximum M dummy keys in the second round, the probability that all participating nodes 
of Vi are either malicious or specified as non-responding nodes by the supplier is less than 
(1 _ _^_)iV-(T+Af)-i_ Using a = (T + M)/N and /? = w/N, then (1 - J ^ f )N-{T+M)-i = 

(1 ~~ i-Af-i ) Ar ( 1 ~")~ 1 . This probability is depicted in Figure 3 depending on a, ft and N. 




(a) 7V=100 (b) iV=300 



Figure 3: Success probability of guessing participating nodes depending on j3 and different 
values of a and N. 



9 Note that all nodes send responses in the second round, and the randomness of d ensures that the supplier 
cannot gain any knowledge about the participating nodes of any nodes. 
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Lying about cluster size 

Another strategy for the supplier to compromise the privacy of users is to lie about the 
cluster size. If the supplier pretends that the cluster size N' is larger than it really is (i.e., 
N' > N), the noise added by each node will be underestimated. In fact, each node will 
calibrate its gamma noise using N' instead of N. As a result, the aggregated noise at the 
supplier will be smaller than necessary to guarantee sufficient differential privacy. 

In order to prevent this attack, a solution would be to set the cluster size to a fixed value. 
For example, all clusters should have a size of 100. Although simple and efficient, this solution 
is not flexible and might not be applicable to all scenarios. Another option is that the supplier 
publishes, together with the list of cluster nodes, a self-signed certificate of each node of the 
cluster (containing a timestamp, the cluster id and the node information). That way, each 
node could verify the cluster size and get information about other member nodes. 

9 Simulation results 

9.1 A high-resolution electricity trace simulator 

Due to the lack of high-resolution real world data, we implemented a domestic electricity 
demand model [2 I ] that can generate one-minute resolution synthetic consumption data of 
different households 10 . It is an extended version of the simulator developed in [21]. The 
simulator includes 33 different appliances and implements a separate lighting model which 
takes into account the level of natural daylight depending on the month of the year. The 
number of residents in each household is randomly selected between 1 and 5. A trace is 
associated to a household and generated as follows: (1) A number of active persons is selected 
according to some distribution derived from real statistics. This number may vary as some 
members can enter or leave the house. (2) A set of appliances is then selected and activated 
at different time of the day according to another distribution, which was also derived from 
real statistics. 

The input of the simulator is the number of households, the month of the day, and the 
type of the day (either a working or weekend day). The output is the power demand model 
(1-min profile) of all appliances in each household on the given day. Using this simulator, we 
generated 3000 electricity traces corresponding to different households on a working day in 
November, where the number of residents in each household was randomly selected between 
1 and 5. Each trace was then sanitized according to our scheme. The noise added in each 
slot (i.e., X(t)) is set to the maximum consumption in the slot (i.e., X(t) = maxi<j<jvX| 
where the maximum is taken on all users in the cluster). This amount of noise ensures e = 1 
indistinguishability for individual measurements in all slots. Although one can increase X(t) 
to get better privacy, the error will also increase. Note that the error /J, s '(t) for other e' ^ e 
values if fJ> E (t) is given is /v(i) = % "A 1 e(£)- We assume that X(t) = maxj X\ is known a priori. 

9.2 Error according to the cluster size 

The error introduced by our scheme depends on the cluster size N. In this section, we 
present how the error varies according to N. 

10 Available at http://www.crysys.hu/~acs/misc/ 
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9.2.1 Random clustering 

The most straightforward scheme to build iV-sized clusters is to select iV users uniformly 
at random. The advantage of this approach is that users only need to send the noisy aggregate 
to the supplier. Figure 4(a) and 4(b) show the average error value and its standard deviation, 
resp., depending on the size of the cluster. The average error of a given cluster size N is the 
average of meanj( / u(t)) of all TV-sized clusters 11 . Obviously, higher N causes smaller error. 
Furthermore, a high a results in larger noise added by each meters, as described in Section 7.3, 
which also implies larger error. Interestingly, increasing the sampling period (i.e., T p ) results 
in slight error decrease 12 , hence, we only considered 10 min sampling period. Otherwise noted 
explicitly, we assume 10 min sampling period in the sequel. 
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(c) Maximum error 



Figure 4: The error depending on N using random clustering. T p is 10 min. 



9.2.2 Consumption based clustering 

As X(t) is set to the maximum consumption at t inside a cluster, we could get lower error 
if the maximum consumption is close to the mean of the measurements within a cluster in 
every t. Hence, instead of randomly clustering users, a more clever approach is to cluster 
them based on the "similarity" of their consumption profiles. Intuitively, the measurements 
in similar profiles are close, and thus, the difference between the maximum consumption and 
the average should also be smaller than in a random cluster. 

We measure profile similarity by the average daily consumption: the iV-sized clusters 
are created by calculating daily consumption levels £±,£2, ■ ■ ■ ,£ n (where £j < £i + \ for all 

11 In fact, the average error is approximated in Figure 4(a): we picked up 200 different clusters for each N, 
and plotted the average of their meant((J.(t)) . 200 is chosen according to experimental analysis. Above 200, 
the average error do not change significantly. 

12 This increase is less than 0.01 even if N is small when the sampling period is changed from 5 min to 15 
min. 
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1 < i < n — 1) such that the number of users whose daily average is between £{ and 
for all i is exactly N. Then, all users being in the same level form a cluster. In contrast to 
random clustering, users need to provide the supplier with their daily averages which may leak 
some private information. However, this can also be derived from the (monthly) aggregate 
consumption of each user, which is generally revealed for the purpose of billing. 

Figure 5(a) and 5(b) show the average error and its deviation, resp., calculated identically 
to random clustering. Comparing Figure 5 and 4, consumption based clustering has lower 
error than the random one. The improvement varies up to 5% depending on N. For instance, 
while random clustering provides an average error of 0.13 with N = 100 users in a cluster, 
consumption based clustering has 0.07. The difference decreases as A" increases. There are 
more significant differences between the standard deviations and the worst cases: at lower 
values of N, the standard deviation of the average error in random clustering is almost twice 
as large as in consumption based clustering (Figure 5(b) and 4(b)). To compute the worst 
case error, at a given N, the maximum error is computed in all slots, which is the highest 
cluster error that can occur in a slot with cluster size N. Then, the average of these maximum 
errors (the average is taken on all slots) are plotted in Figure 4(c) and 5(c). Apparently, the 
worst case error in random clustering is much higher than in consumption based clustering, 
as random clustering may put high and low consuming users into the same cluster. 
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Figure 5: The error depending on A" using consumption based clustering. T p is 10 min. 



9.3 Privacy over multiple slots 

So far, we have considered the privacy of individual slots, i.e. added noise to guarantee 
e = 1 privacy in each slot of size 10 minutes. However, a trace is composed of several 
slots. For instance, if a user watches TV during multiple slots, we have guaranteed that an 
adversary cannot tell if the TV is watched in any particular slot (up to e = 1). However, by 
analysing s consecutive slots corresponding to a given period, it may be able to tell whether 
the TV was watched during that period (the privacy bound of this is e s = e ■ s due to the 
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composition property of differential privacy). Based on Theorem 1, we need to add noise 
X(t) = J2l=i maxj X\ to each aggregate to guarantee e s = 1 bound in consecutive s slots, 
which, of course, results in higher error than in the case of s = 1 that we have assumed 
so far. Obviously, using the LPA technique, we cannot guarantee reasonably low error if s 
increases, as the necessary noise A(f) = Ylt=i max i X\ can be large. In order to keep the error 

\{t)/Y,i=i X t low while 

ensuring better privacy than e s = s ■ e, one can increase the number 
of users inside each cluster (i.e., N). 

Figure 6(a) shows what average privacy of a user has, in our dataset, as a function of the 
cluster size and value s. As the cluster size increases, the privacy bound decreases (i.e. privacy 
increases). The reason is that when the cluster size increases, the maximum consumption 
also increases with high probability. Since the noise is calibrated according to the maximum 
consumption within the cluster, it will be larger. This results in better privacy. 




(a) All appliances (b) Active appliances 



Figure 6: Privacy of appliances in s long time windows (where s is 10 min, 15 min, 30 min, 
1 h, 4 h, 8 h, 1 day). 



9.3.1 Privacy of appliances 

In the previous section, we analysed how a user's privacy varies over time. In this section, 
we consider the privacy of the different appliances. For example, we aim at answering the 
following question: what was the user's privacy when he was watching TV last evening between 
18:00 and 20:00? More specifically, we consider two privacy threats: 

• Presence of appliances: Can the adversary tell that the user watched TV yesterday? In 
order to compute the corresponding privacy (i.e. e s ), we compute X^i=i08 e (*)' wnere 
e(t) = {TV's consumption in t}/X(t). 

• Activation time of appliances: If the adversary knows that the user watched TV, can 
he tell what time he did it? We use statistical inference to detect the position of an 
appliance signature in the noisy trace. 



Presence of appliance: We summarized some of the appliance privacy in Table 1 in Ap- 
pendix B. Each value is computed by averaging the privacy provided in our 3000 traces. The 
appliances can be divided into two major groups: the usage of active appliances indicate that 
the user is at home and uses the appliance (their consumption significantly changes during 
their active usage such as iron, vacuum, kettle, etc.), whereas passive appliances (like fridge, 
freezers, storage heater, etc.) have more or less identical consumption regardless the user is 
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at home or not. In general, appliances having lower consumption threats privacy less than de- 
vices with higher energy demands. Obviously, e s increases when s increases since an appliance 
is used more frequently within longer periods. 

Finally, we want to measure the privacy of active appliances. This is equivalent to answer 
the question if the user was at home in any s long period. The average privacy are depicted 
in Figure 6(b). Observe that there is no considerable differences between Figures 6(b) and 
6(a), as a profile is primarily shaped by active appliances (because they typically consume 
much more than passive appliances). 

Activation time of appliances: Consider the consumption profile V = (Vi, V~2, • • • , V n ) of 
a given appliance of a user (on a single day) , where the appliance is switched on at t s first and 
switched off at t s + d last (i.e., Vi = for 1 < i < t s and t s + d < i < n). The signature of the 
appliance Sig(V) = (Vt s , Vt s +i, ■ ■ ■ , Vt s +d) is the consumption profile of the appliance between 
t s and t s + d. The adversary is provided with the noisy consumption profile of the appliance 
(i.e., V) and, in addition, knows the signature of the appliance, but it does not know t s (i.e., 
it knows that the appliance was used with the given signature but does not know when). 

The goal of the adversary is to infer the starting slot t s in V using V. If the adversary's 
guess is tf , the inference accuracy is measured by \t'— t s \. We consider the following adversaries: 

• RG-Adv: This is the simple random guesser and serves as a baseline. If there are n — d 
possible values of t s , then the guess t' is selected out of them uniformly at random. 

• ST-Adv: This adversary knows the relative frequency of each slot occuring as a starting 
slot (denoted by fi at slot i), and guesses the most likely starting slot: t' = maxj/j 
(1 < i < n — d). This information is publicly available from several surveys [21]. 

• Bayesian-Adv: This adversary performs bayesian inference on t s . In particular, let V* 
denote a profile where the signature starts at slot t (i.e., V* is obtained by shifting V 
with \t — t s \ positions to left/right if t — t s is negative/positive. 13 ). Assuming that the 
adversary has no prior knowledge about the distribution of starting slots (i.e., they are 
distributed uniformly at random), the posterior distribution is computed as 



where T describes the posterior distribution of starting slots. As the bayes risk is 
"linear" in our case (i.e., \tf — t s \), the bayes' estimate (i.e., tf) is the posterior median 
(i.e., tf satisfies P(T < tf) > 0.5 and P(T > tf) < 0.5). 

• Bayesian-ST-Adv: We expect better results if the bayesian adversary uses the relative 
frequencies as a prior knowledge. In particular, the adversary knows the probability dis- 
tribution of starting slots a priori, denoted by 6 = {/1, /2, • • • , fn-d}, which is described 
by the relative frequencies: 



P(T = i) 



i:Uxn=i p (yi+c{\ k ) = v k ) 



P(T = i\9) 



Uk=ifi-P(vi + ^k) = Vk) 



E]=xUUP(vi + jo(x k ) = v k ).f J 



As before, the bayes' estimate is the posterior median. 



More formally, V = for all 1 < i < t and t + d < i < n, and Vi = Sig(V)j for 1 < i < t + d 
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The inference accuracy of each adversary is shown in Table 2 in Appendix B. The inference 
is performed on our dataset within a single day. Bayesian-ST-Adv outperforms all adversaries 
especially for active devices, however, its accuracy never falls below 1.7 hour. Regarding the 
passive appliances, ST-Adv overcomes Bayesian-ST-Adv in general. This is explained by the 
fact that passive appliances usually follow a regular operation cycle with less user intervention 
in all households, and the accuracy of ST-Adv's is always within the length of one operation 
cycle independently of the added noise 14 . 

10 Conclusion 

Our measurements show two different, and conflicting, results. Figure 6(a) shows that it 
may actually be difficult to hide the presence of activities in a household. In fact, computed e 
values are quite high, even for large clusters. However, results presented in Tables 1 and 2 are 
more encouraging. They show that, although, it might be difficult to hide a user's presence, 
it is still possible to hide his actual activity. In fact, appliances privacy bounds (e values) are 
quite small, which indicates that an adversary will have difficulty telling whether the user is, 
for example, using his computer or watching TV during a given period of time. Furthermore, 
in Table 2, results show that it is even more difficult for an adversary to tell when a given 
activity actually started. Finally, we recall that in order to keep the error X(t)/^2^ =1 Xl 
low while ensuring better privacy one can always increase the number of users inside each 
cluster. For instance, doubling N from 100 to 200 allows to double the noise while keeping 
approximately the same error value (0.118 in Figure 5(a) if a = 0). This results in much 
better privacy, since, on average, doubling the noise halves the privacy parameter e s . 

Although more work and research is needed, we believe this is a encouraging result for 
privacy. 
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A Proof of Theorem 2 (Utility) 

Lemma 2 (Integral property of the Bessel function [15]) Let 

Kt(x) = l(%Y I" t-*- 1 exp ( -t - ^] dt, x>0 



2 v 2 y Jo r \ 4 V 

define the modified Bessel function of the third kind with index •& G R. For any 7 > and 
7, v such that 7 + 1 ± v > 

f^ ( ^^ r (i±2±.) r (i±p:) 

Lemma 3 Let Gi,G 2 be i.i.d gamma random variables with parameters (n,A). Then, 

2A 



E\Gi(n,\)-g 2 (n,\)\ 



and 



Var\g l (n,X)-g 2 (n,X)\ = - - 2 A (2) 

V B [ 2 , n ) / 

where B(x,y) is the beta function defined as B(x,y) = -w^j^r • 
Proof (of Lemma 3) Consider 3^ = £?i — Q2- The characteristic function of y is 

/ I \ n I 1 \ 71 / 1 



1 + iXt) \l-i\t \ 1 + A 2 t 2 



which is a special case of the characteristic function of the Generalized Asymetric Laplace 
distribution (GAL) with parameters (9, k,oj,t): 



} gal 



(t) = e 



1 



m 

' 1 + i^fuJKt J V 1 - 



where 6 = 0, k = 1,oj = \/2A, and r = 1/n. The density function of GAL(9, k,uj,t) when 
8 = and re = 1 is 

V2 f\x\Y~ 1/2 r 
f0AL(x) = wT+1/2r(r)0F K T _ 1/2 (V2\x\/u) 

where K T _i/ 2 (^-\x\) is the Bessel function defined in Lemma 2. In addition, 

/°° f°° \/2 ( t \ r-1 / 2 



which follows from the symmetry property of fGAL{%) {^y{t) is is rea l valued). After refor- 
mulation, we have 

E|y| 2V ~ 2 



j- 1/2 V2 f x^K T _ 1/2 (V2x/u)di 

y/2 1/i ^+V2r(r)0F^o 
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Now, we can apply Lemma 2 for the integral and we obtain 

r(r + i) 



w 



after simple derivation. Using that yfH = T(l/2) and B(x,y) = St^t^t, we have 



r(J)V5F 

r(x+j/) ■ 



Applying cj = \/2A and r = 1/n, we arrive at Equation (1). 
To prove Equation (2), consider that 

Var(\y\)=K\y\ 2 -[E\y\\ 2 

where 

E|^| 2 = E(^ 2 ) = E(S 2 ) + E(C? 2 2 ) - 2 • E(&) • E(&) 
Using that E(£ 2 ) = E(£f) = (1/n 2 + l/n)A 2 , we obtain Equation (2). 

Now, we can easily prove Theorem 2. 
Proof (of Theorem 2) 

N 

E\J2(Xt - Xt)\ = 



4=1 



E| J^0i(jV-M, A)-^g 2 (Af-M, A) | 



4=1 4=1 

(using that £? = i A) = 0(1/ £™ = i A)) 

= E|&(1 - M/N, A) - g 2 (l - M/iV, A)| 
(using a = M/N and applying Lemma 3) 

2 



A 



The standard deviation \ / Var \ X)i=i(-^t ~~ -^Q)l can b e derived identically. 



B Privacy of some ordinary appliances 
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Appliance 
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Til n 'T 




= 
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0.91 
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1.82 
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3.63 
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21.49 


4.89 
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0.02 


0.04 


0.79 


0.04 


0.04 


0.81 


0.05 


0.05 


0.82 


0.07 


0.05 
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0.09 


0.07 


0.96 
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0.10 


0.17 


4.43 


0.16 


0.19 


4.59 


0.17 
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4.62 


0.18 


0.21 


4.62 


0.19 


0.21 


4.62 




Iron 


0.75 


1.81 


42.91 


0.82 


1.82 


42.99 


0.92 


1.83 


42.99 


1.00 


1.86 


42.99 


1.02 


1.89 


42.99 




Vacuum 


1.67 


7.59 


134.54 


1.70 


7.59 


134.54 


1.82 


7.58 
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1.90 


7.60 


134.54 


1.94 


7.63 


134.54 
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0.04 


0.10 


1.55 


0.04 


0.10 


1.55 


0.04 


0.10 


1.55 


0.05 


0.10 


1.56 


0.05 


0.10 


1.56 




Personal computer 


0.21 


0.32 


7.48 


0.34 


0.36 


7.48 


0.83 


0.49 


7.48 


1.09 


0.58 


7.53 


1.42 


0.83 


8.37 
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0.07 


0.30 


7.78 


0.08 


0.31 


7.78 


0.09 


0.31 


7.78 


0.10 


0.31 


7.78 


0.11 


0.31 


7.83 





1 V 


0.15 


0.47 


7.41 


0.22 


0.48 


7.45 


0.37 


0.52 


7.45 


0.45 


0.58 


8.37 


0.50 


0.63 


8.37 


a 
a 


~\ Tf^Y) I T\\T~T\ 


0.05 


0.16 


2.81 


0.07 


0.17 


2.84 


0.10 


0.17 


2.89 


0.13 


0.18 


2.95 


0.14 


0.19 


3.01 


"E 


TV Receiver box 


0.03 


0.11 


2.12 


0.05 


0.11 


2.21 


0.08 


0.12 


2.32 


0.10 


0.13 
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0.14 
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a. 
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:tive 
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4.24 
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A O A 

4.24 
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1.29 


A 07 

4.2 ( 


OO 1 7 

83.1 ( 


1 1 

1.31 
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4.29 


OO C7 

83. 01 


w 

<< 


rvettie 


0.55 


2.71 


63.59 


0.59 


2.71 


63.59 


0.72 


2.73 


63.87 


0.83 


2.76 


P A OO 

04.22 


1.02 


2.79 


P A OO 

04.22 




Small cooking (group) 
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1.01 


OP 1 

2d. Id 
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OP 1 P 
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OP 1 P 
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1.02 


OP 1 P 
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Dish washer 


n no 
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rr p A 
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1.49 


o P^7 

2.0/ 


K K P A 

00.64 


1 TO 

1. (0 
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00.64 
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6U.10 


o no 
2.U3 
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2.9/ 
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6U.10 




Tumble dryer 


O K*7 

2.0 / 


o.Uo 


1 KO 00 

102. oo 


OO 

o.9o 


8.10 


104.99 


K OA 
0.24 


o. on 
8.2U 


i kk no 
100. U8 


p on 
O.oU 


0. 00 


i kk ns 
100. U8 


7 ni 
( .Ul 


S po. 

8.08 


i kk ns 
100. U8 




VVablllllg IllclCIllIie 


1.23 


1.43 


31.57 


1.30 


1.45 


31.72 


1.96 


1.63 


33.24 


2.55 


1.76 


33.24 


3.07 


2.07 


34.62 




Washer dryer 


1.82 


1.08 


19.22 


3.17 


1.33 


19.27 


4.70 


1.99 


25.82 


6.39 


2.38 


25.82 


7.92 


3.49 


33.66 




E-INST 


1.47 


1.12 


6.54 


1.93 


1.15 


6.54 


3.47 


1.16 


7.58 


4.70 


1.49 


9.00 


7.06 


2.13 


10.99 




Electric shower 


2.13 


14.78 


249.24 


2.16 


14.78 


249.24 


2.28 


14.78 


249.24 


2.34 


14.78 


249.24 


2.38 


14.80 


249.24 


d 


DESWH 


3.34 


14.01 


249.29 


4.04 


14.04 


251.01 


6.13 


14.06 


253.21 


7.83 


14.23 


255.20 


10.85 


14.57 


257.76 


a 


Storage heaters 


3.22 


0.32 


3.96 


5.64 


0.56 


6.95 


20.20 


1.99 


24.87 


30.45 


4.23 


41.48 


30.45 


4.23 


41.48 


sive 


Elec. space heating 


1.64 


0.85 


6.14 


2.86 


1.07 


7.54 


7.49 


2.15 


13.03 


8.50 


2.49 


14.57 


10.06 


4.08 


26.25 


Chest freezer 


0.61 


0.74 


15.94 


0.61 


0.74 


15.95 


1.39 


0.92 


17.20 


1.85 


1.07 


18.10 


2.55 


1.24 


18.96 


Pas 


Fridge freezer 


0.91 


0.39 


7.56 


0.91 


0.40 


7.61 


2.19 


0.95 


8.67 


2.94 


1.25 


10.58 


4.07 


1.61 


11.69 


Refrigerator 


0.44 


0.22 


3.83 


0.45 


0.23 


4.00 j 


1.06 


0.49 


4.77 


1.40 


0.64 


5.68 


1.92 


0.80 


6.50 




Upright freezer 


0.67 


0.39 


8.37 


0.67 


0.39 


8.42 


1.63 


0.80 


9.09 


2.16 


1.03 


10.99 


2.98 


1.31 


11.98 



Table 1: e s of different appliances in case of different s. N = 100 and the sampling period is 10 min. 
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Lighting 


z99e 


2.53 


2.54 


5.16 


3.66 


1.87 


1.73 


1.71 


2.34 




Cassette / OU Flayer 


2650 


4.70 


4.09 


3.34 


3.70 


3.96 


2.50 


3.19 


3.46 




xll-r 1 


1 A A 

1 44 


7.49 


5.49 


4.10 


3.29 


5.58 


3.29 


4.04 


2.65 




Iron 


1 O A 1 

1Z4 ( 


6.53 


4.40 


3.89 


3.19 


3.62 


2.94 


2.95 


2.28 




Vacuum 


1 1 no 
llyz 


6.61 


4.47 


4.00 


3.22 


3.54 


3.02 


2.92 


2.45 




rax 


O A 1 

Z41 


6.85 


5.66 


7.78 


4.81 


5.76 


3.26 


4.19 


2.85 




Personal computer 


i Q7n 
1» 1 U 


5.35 


4.50 


5.32 


4.55 


4.79 


3.20 


4.03 


3.49 




Printer 


lOUo 


6.21 


5.06 


5.73 


4.69 


4.71 


3.01 


4.07 


3.04 


v 




TV 


2519 


5.41 


4.07 


4.22 


3.41 


3.73 


2.44 


2.50 


2.50 


plian 


VCR / DVD 


2299 


5.55 


4.09 


4.29 


3.44 


3.72 


2.42 


2.53 


2.57 


TV Receiver box 


2413 


5.58 


4.09 


4.27 


3.42 


3.71 


2.37 


2.53 


2.58 


a 


Hob 


857 


6.53 


4.49 


3.64 


3.19 


3.55 


2.89 


2.95 


2.48 


:tive 


Oven 


760 


6.31 


4.50 


3.78 


3.13 


3.35 


2.99 


2.74 


2.41 


Microwave 


505 


6.41 


4.24 


3.96 


3.17 


3.39 


2.97 


2.90 


2.44 


w 

<: 


Kettle 


2808 


4.81 


4.13 


3.62 


3.84 


3.83 


2.67 


3.29 


3.48 




Small cooking (group) 


1441 


6.55 


4.41 


3.92 


3.18 


3.51 


2.65 


3.00 


2.40 




Dish washer 


434 


6.32 


4.46 


4.57 


3.39 


3.28 


3.00 


2.71 


2.19 




Tumble dryer 


1018 


5.79 


4.15 


4.32 


3.37 


2.23 


2.56 


2.03 


2.57 




Washing machine 


2228 


5.28 


4.02 


3.58 


3.31 


2.85 


2.67 


2.36 


2.80 




Washer dryer 


417 


5.05 


3.77 


3.26 


3.07 


1.94 


2.21 


1.79 


2.66 




E-INST 


29 


3.05 


2.42 


1.87 


3.09 


1.84 


2.35 


1.71 


2.88 




Electric shower 


1039 


6.10 


4.31 


3.76 


3.12 


3.47 


2.96 


2.89 


2.36 


d 


DESWH 


510 


3.70 


3.41 


2.54 


3.14 


1.22 


1.73 


1.54 


2.46 


a 

cd 


Storage heaters 


84 


8.50 


5.84 


0.00 


0.00 


0.27 


0.25 


0.00 


0.00 


sive 


Elec. space heating 


73 


6.52 


5.05 


6.75 


5.02 


2.85 


3.42 


2.14 


3.14 


Chest freezer 


466 


0.56 


0.47 


0.51 


0.44 


0.42 


0.32 


0.40 


0.34 


Pas 


Fridge freezer 


1954 


0.49 


0.42 


0.28 


0.30 


0.34 


0.29 


0.34 


0.33 


Refrigerator 


1301 


0.56 


0.48 


0.35 


0.39 


0.40 


0.33 


0.41 


0.42 




Upright freezer 


866 


0.55 


0.46 


0.35 


0.37 


0.38 


0.30 


0.37 


0.36 



Table 2: Inference accuracy of starting slots. N = 100, T p = 10 min, and of users" means 
the number of users who have the given appliance in our dataset. The accuracy (|t' — t s \) is 
given in hours. 
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